In this project, I focused on exploring the data from a business intelligence standpoint.
In the first part, I analyzed TV, App, and Website datasets seperately and summarized my findings in the form of table and plots. I also explored the connections among three datasets, trying to see what actional insights we can gain. Each of my analyses begins with a question, followed with approaches and data visualizations, and finally gives my conclusions/recommendations.
In the second part, I conducted audience segmentation using TV data and brainstormed about how segmentation can be improved if more data is available and gave marketing suggestions.
Lastly, I listed what I would like to do in order to further explore the data as the next step.
The TV dataset contains audience behavior information including view time, tv programs and network for each user in the week of 2017-01-02. It originally contains 1062961 observations and 30 variables. After data cleaning and manipulation such as removing unreasonable records, converting data types, and removing duplicate/useless columns, it contains 531172 observations and 26 variables (use id included).
The App data contains app usage information including App name, device used, and time used on each App for each user on 2017-01-02. It originally contains 797563 observations and 8 variables. After removing duplicate/useless columns and removing unreasonable records, it contains 796545 observations and 6 variables (use id included).
The original Web data contains website usage information including website name, device used, and time spent on each website for each user on 2017-01-02. It originally contains 33521 observations and 8 variables. Records with TOTAL_MINUTES less than 1 are removed from the dataset. After removing duplicate/useless columns and removing records with unreasonable records, it contains 15168 observations and 7 variables (use id included).
I first wanted to see the if number of TV audience and watching time differs by day of week. Number of unique audience on each day, total time spent(in hrs) on each day, and average time spent(in hrs) are calculated and summarized in the following table. I expect that Friday and Weekends will have more audiences and average spent time on TV than any day from Monday to Thursday.
Please note that number of TV audience is calculated by counting the number of unique user id to prevent duplicate counts for users.
From the above table we can see that in the week of 2017-01-02, number of audience are very similar on each day - around 12k. Thursday, Sunday and Friday have the most audiences.
The total amount of hours spent on TV is from around 25k to 35k with Sunday (34907 hours) and Saturday (31565) being the most and second most.
Sunday and Saturday also had the top two average view time: on average, an audience spent 2.85 and 2.62 hours on Sunday and Saturday, respectively.
People watched TV about 30 minutes more per day on weekends(2.73 hrs) than on weekdays(2.23 hrs).
To my suprise, Monday is on the third most total hours spent (30531.14 hours) and it also has the third most average view time, 2.53 hours.
To see how audience behaviors differ by part of day the show is on air, we need to know audiences’ view time. However since we don’t have exact view time information in the dataset, I only analyzed the LIVE audiences whose view time can reference Air time.
I first created a subset for LIVE audiences and then counted the number of users in each AIR_DAY_PART.
Please note that number of TV audience is calculated by counting the number of unique user id to prevent duplicate counts for users.
We can see that during a day, most audiences watch Live TV during Prime time(8pm - 11pm). Then it follows by Daytime(9am - 3pm) and Early Fringe(3pm-5pm). Overnight have the least number of audiences throughout the day.
If calculate average number of LIVE audience in weekdays and weekends seperately as below, we can see that although prime hours were peak TV viewing times on both weekend days and weekdays, number of audience was greater on weekend days beginning from 9PM and continuing until 5AM. In another way, number of audience on weekday is greater than that on weekday only during Early Morning.
Comedy, Drama, Reality, News and Children are the top 5 genres which have the most number of audiences. Spanish language, Sports talk, and Art/Music are the bottom 3 genres which have the least number of audiences.
Although Comedy has the largest audience group(109k), only 47.9% of them watched the shows on LIVE. However, although only 36k audiences watched Sports, 83% of them watched the shows on LIVE. Similar thing happened to Talk/Variety: out of 36k audiences who watched them, 72% of them watched on LIVE.
Today’s program content is viewed on more than just television sets. Consumers are watching via the Internet and on mobile devices, in-home and out-of-home, live and time-shifted. Therefore, I would like to see how video view platform is distributed among our audience.
To approach this problem, I grouped the data by video platform categories and then counted the number of unique users indicated by unique USER_META_ID.
I chose pie chart here to show each population because it shows both the proportion and number of each population, which enables us to compare 4 populations easily. We can see that around 60% of audiences watched TV through either Live or DVR.
To approach this problem, I calculated the total time spent on each App category. Then I calculated the proportion of total time spent on each App category. There are 17 categories such as Food&Drink, Family&Kids, Educational took up less than 5% of the total time, therefore they are categorized as ‘Others’.
The following pie chart can clearly show the proportion of time spent in each App category of the total time.
From the above pie chart we can see tha Social(18.5%) and Entertainment&Lifestyle(13.8%) are the major App categories. Then it follows Email&Communication, Misc, and Browser.
I paid extra attention on Entertainment&Lifestyle Apps because it contains streaming TV services and video platforms such as Youtube, Netflix, Hulu, Roku, Hbo Go and so on. Why is this important?
From what Los Angelas Times mentioned, ‘while the majority of viewers watch the old-fashioned way — live and seated in front of a TV screen — new technologies are rapidly transforming the way programming is consumed. The upending of television is being led by digital video recorders, video on-demand and streaming sites such as Netflix, Hulu and Amazon that can be watched on mobile phones and tablets as well’.[https://www.latimes.com/entertainment/tv/la-et-st-tv-section-ratings-20141123-story.html]
This indicates since more audiences tend to watch shows using their mobile devices, it becomes hard to track users’ watching behaviors and measure their content consumptions, and therefore challenged to create completed user profile.
Therefore, I looked at the App consumption within Entertainment&Lifestyle and see if our users spent much time on streaming services on their mobile devices.
There are many Apps within Entertainment&Lifestyle. Here I am only picking up a few popular streaming services and look at users’ time consumption on them.
On 2017-01-02,
This is not a completed list for streaming services and video platforms, however it already shows that many users spent much time on them, taking the market share from traditional TV viewing.
Cross-platform tracking can make it possible to get more accurate statistics of the users and comprehensive info about the users since users’ identities are not split into pieces over multiple platforms(cable TV, streaming services, etc). For our case, if we have data on audience view information such as program information, view time, and elspsed time collected from streaming TV apps and video platform Apps, we will be able to better analyze users’ viewing behavior and create customized marketing strategies for different audiences, e.g. recommend customized tv programs and send promotion advertisements.
Note: Time is in minutes.
Most users spent time on Websites during 9AM to 5PM, and on average, each user spent the most time during this period of time as well.
The results from above analysis makes me wonder whether there is a difference between web usage and TV usage on different day parts. Therefore, I further created bar charts to compare the usages below:
From above bar chart we can see that on 2017/01/02, our participates who watched LIVE TV spent much more time on TV than website on all day parts. Day time is most liked for both TV viewers and website viewers.
Please notice that our data for website usage is only for 2017-01-02, so the above conclusion can be very biased and therefore the conclusion may applicable for other times.
To answer this question, I calculated the total time spent, number of users, and average time spent on each website.
Our data is limited in the sense that we only know what webpage users browsed but NOT the content they looked at(not sure if it’s legal). For example, if we know that a user checked Twitter and read about discussions about a recent movie, it is likely that this user is interested in the movie.
Given the background that at Viacom, the focus of our business is to engage global audiences and deliver compelling content to our fans across all platforms, audience segmentation seems to be a topic that deserves much focus. In this project, I measured audience behavior by program info from TV data. To be more specific, I selected AIR_DAY_PART_DESC, AIR_DOW, TIMESHIFT_INDICATOR_DESC, MASTER_GENRE_DESC, VIDEO_VIEW_PLATFORM, SYNDICATION_GROUP, PROGRAM_NAME from TV data as well as a new feature which is percentage of time watched for a show (ELAPSED_TIME/SHOW_DURATION). Audiences were clustered into the same group only if all of these features of them are same.
Due to time limit, I only picked one cluster here as an example:
## USERS_META_ID AIR_DAY_PART_DESC AIR_DOW TIMESHIFT_INDICATOR_DESC
## 1 2120025 Prime (8PM-11PM) MON On demand
## 2 2331905 Prime (8PM-11PM) MON On demand
## 3 208481 Prime (8PM-11PM) MON On demand
## 4 2367253 Prime (8PM-11PM) MON On demand
## 5 2326427 Prime (8PM-11PM) MON On demand
## 6 2260066 Prime (8PM-11PM) MON On demand
## MASTER_GENRE_DESC VIDEO_VIEW_PLATFORM SYNDICATION_GROUP PROGRAM_NAME
## 1 Drama OTT Broadcast Network Grey's Anatomy
## 2 Drama OTT Broadcast Network Grey's Anatomy
## 3 Drama OTT Broadcast Network Grey's Anatomy
## 4 Drama OTT Broadcast Network Grey's Anatomy
## 5 Drama OTT Broadcast Network Grey's Anatomy
## 6 Drama OTT Broadcast Network Grey's Anatomy
## p Cluster_ID
## 1 0.9 128020
## 2 0.9 128020
## 3 0.9 128020
## 4 0.9 128020
## 5 0.9 128020
## 6 0.9 128020
We see that the above users all watched Grey’s Anatomy on streaming platforms during 8PM-11PM and finished 90% time of the show. This might indicate these users have similar watching behavior. But it can be a coincidence since Grey’s Anatomy is very popular and many people watch TV during Prime time. Let’s pick two users and see whether or not they both watched other shows that are within the same genre. to see if they share similar tastes in TV. For example, users with id 2120025 and 2367253.
| Action/ Adventure/ SciFi | Art/Music | Awards & Specials | Children | Comedy | Documentary | Drama | Game show |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 13 | 0 | 13 | 0 |
| Instruction/ Information | News | Other | Reality | Spanish Language | Sports | Sports talk | Talk/Variety |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| DVR | LIVE | OTT | VOD |
|---|---|---|---|
| 0 | 12 | 16 | 0 |
From the above table, we can see that USERS_META_ID 2120025 watched drama for 13 times and comedy for 13 times as well. Out of 28 records, s/he watched TV on streaming services for 16 times and LIVE for 12 times.
| Action/ Adventure/ SciFi | Art/Music | Awards & Specials | Children | Comedy | Documentary | Drama | Game show |
|---|---|---|---|---|---|---|---|
| 2 | 0 | 0 | 0 | 6 | 0 | 14 | 0 |
| Instruction/ Information | News | Other | Reality | Spanish Language | Sports | Sports talk | Talk/Variety |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 |
| DVR | LIVE | OTT | VOD |
|---|---|---|---|
| 1 | 7 | 19 | 0 |
From the above tables, we can see that USERS_META_ID 2367253 watched drama for 14 times and comedy for 6 times as well. Out of 27 records, s/he watched TV on streaming services for 19 times and LIVE for 7 times.
By comparing genres and video platforms of these 2 users, we see that both users prefer watching dramas and comedies through streaming platforms. Therefore, a marketing recommendation is that we can recommend contents that user 2367253 watched to user 2120025 and vice versa because they might like to watch similar TV programs. Since both of them watched more on streaming platforms, promotions for streaming services can be considered for both of them.
We can not certainly say that two people who ever watched a same TV program and have similar preferences for program genres have exactly same taste or watching behavior because there are many other factors to consider. However, this can serve as a reference for audience segmentation and further marketing strategy creations.
Partitioning around medoids is an iterative clustering procedure with the following steps:[https://www.r-bloggers.com/clustering-mixed-data-types-in-r/]
It is too time and memory consuming that I couldn’t run it locally. I also tried running it on AWS free instance but it also failed because of running out of memory.
1. Audience Info
If we have viewing data along with audiences’demographic and psychographic profiles, we would be able to better picture our audiences and therefore building an more accurate audience segmentation model. For example, for two people B and C who have the same gender, age level and occupation as well as some overlapped TV programs subscription, it is likely that they share similar tastes on TV shows. A TV show that B likes might also be interesting to C and vice versa. Therefore, based on user similarities, we can taylor out content recommendations for users.
2. Social Media Data
Social media data can also help with andience segmentation. For example, if we know person A liked a Facebook page about a sitcom that is about to show LIVE next month. It is likely that A likes this sitcom and wants to watch the show. Therefore, we can send A relevant offers and advertisements.
Another thing about social media is that we can perform sentiment analysis on users’ comments on TV programs to see their opinions on them. If we have enough reviews data, we can learn each person or groups preference, and therefore recommending taylored contents.
3. Purchasing Behavior Data
If we have users’ purchasing behavior data, we can estimate how much an audience is worth to our company by calculating customer lifetime value (CLV) by means of Recency-Frequency-Monetary framework. We can then cluster users in terms of their CLV and loyalty and create targeted market plans (ads, campaigns, etc.) for different user groups.
As I have found clusters of users who share similar watching behavior and content consumption in the first part of Audience Segmentation, the next thing is to see whether these users also have similarities in terms of App usage and Website usage.
Since one user may have multiple devices, it is possible that they have different preferences about devices for various tasks. For example, a phone is majorly used for contacts while a tablet is more used for entertainment including gaming, videos, and so on. Cross-device tracking enables us to get more accurate statistics of the users and comprehensive info about the users.